[00:05.410 --> 00:12.010]  Bonjour, my name is Elie. I work at Google, where I lead the security and anti-abuse research team.
[00:12.010 --> 00:20.070]  Today, with Jean-Michel, who is here in spirit, we are going to tell you how we can use deep learning to reduce side-channel attack surface.
[00:20.170 --> 00:34.390]  Before getting started, I would like to point out that the research you are about to see are part of a larger project that we do in collaboration with many Googlers and external collaborators around hardening hardware cryptography to create more secure devices.
[00:34.390 --> 00:44.850]  Side-channel attack is one of the most efficient ways to attack secure hardware because instead of targeting the algorithm, which is usually well understood and well scrutinized, let's say AES,
[00:44.850 --> 00:51.250]  instead, it targets the implementation and the interplay with the given hardware.
[00:51.410 --> 01:02.830]  And this is well less scrutinized because A, there is many of them, and B, it is way more subtle to understand how side effects of code affect the specific hardware and how that can be exploited by attackers.
[01:02.830 --> 01:06.570]  Here is a concrete example to show you how powerful side-channel attacks are.
[01:06.570 --> 01:16.030]  Back in 2017, researchers were able to recover Bitcoin private keys out of a Trezor hardware wallet by using side-channel attacks.
[01:16.270 --> 01:29.290]  They show you that despite the algorithm being well reviewed and the hardware being well understood, the interplay between the two still had some problems that you can exploit through side-channel attacks.
[01:30.590 --> 01:36.690]  And from the defender side, side-channel attacks are very difficult because they are very hard to debug and fix.
[01:36.690 --> 01:43.850]  So even if you know you have a side-channel, it's really hard to know where it's coming from and what you can do to fix it.
[01:43.930 --> 01:53.450]  So if side-channel attacks are both very important and hard to debug, it means that there is room for innovation on how to help with the situation.
[01:53.450 --> 01:55.770]  And that's where these projects come in.
[01:55.770 --> 02:03.750]  And our idea was maybe we can try to develop new technology to accurately pinpoint the code which is vulnerable to side-channel attacks.
[02:03.750 --> 02:13.930]  So developers can quickly isolate it and try to improve it and improve the quality of their implementation and be more resilient to side-channel attacks.
[02:14.350 --> 02:16.090]  So that was the idea we had.
[02:16.090 --> 02:32.950]  And then the way we went about that was we proposed to use deep learning and dynamic analysis and combine the two to be able to accurately pinpoint the origin of the leakage which is responsible or which is exploited by a given side-channel attack.
[02:33.190 --> 02:42.710]  And I know, I know what you're thinking. You're going to be like, oh my god, one more deep learning talk. Really, it's all going to be all hype and etc, etc.
[02:42.710 --> 02:48.870]  Well, actually, no. This talk, as I promised, is a hacker journey, so we want to make it very concrete.
[02:48.930 --> 03:00.250]  And what we really want to do today is showcase to you our debugging tool, which we call SCALD, which stands for side-channel attack leak detector, and how that works in practice.
[03:00.250 --> 03:17.190]  And to make it very concrete, today I'm going to show you how we can use SCALD to debug TinyAS, a very vanilla plain implementation of AES, running on a well-known CPU, which is the SMT32F4.
[03:17.190 --> 03:40.190]  So we're going to really see in practice today how you can use the technique we developed to isolate a very clear leakage, and hopefully by the end of the talk you will have a good understanding of what the tool is about and why it might be useful if you are working in the field of cryptography and hardware crypto, or if you are interested in us today.
[03:40.190 --> 03:55.330]  So, with this, how are we going to go about that? First, I'm going to briefly detail how side-channel attack works. Then we're going to discuss how AI-based side-channel attack works, because we need those to be able to pinpoint the leakage.
[03:55.330 --> 04:06.330]  Then I'm going to deep dive a little bit into what is AI explainability, and how you go about explaining what a machine learning sees, or understands, or how it makes a decision.
[04:06.330 --> 04:22.330]  And finally, we'll bring everything together. I'll talk a little bit about the dynamic analysis part of the project, and how we fit it all together to, well, pinpoint the leakage, and I'll show you how it works in practice for the target, our tiny ass on SMT32.
[04:22.330 --> 04:29.070]  We can follow along by getting the slide, at the very least, and hopefully link the code by going to elit.net slash code.
[04:29.650 --> 04:47.610]  As I mentioned, the purpose of this talk is to be very practical, down-to-earth on how can you use, and how this type of tool works at a high level, and how can you use it as a practitioner, not necessarily how you can develop such a technique, and how you can research on how to improve it.
[04:47.610 --> 04:59.530]  If this is something you are interested in, we are working on a paper, a research paper, which has all the technical details about the extension technique we use, the benchmarking, and alternative, and all the good stuff.
[04:59.530 --> 05:08.690]  And hopefully, the paper will be out shortly, by the time you see this recording, or will have been already put on archive by the time you see it.
[05:08.690 --> 05:14.350]  With that out of the way, let's start with recapping what side-chain attacks are.
[05:14.350 --> 05:22.330]  At its core, a side-chain attack is an indirect measurement of a computation result using an auctioneering mechanism.
[05:22.330 --> 05:31.960]  So basically, instead of observing directly the result, we try to infer what it is using a third-party way, which is what we call an auctioneering mechanism.
[05:31.960 --> 05:38.220]  Side-chain attacks are used in many, many ways to attack various targets.
[05:38.260 --> 05:44.780]  Obviously, as discussed in this talk, they are used to recover encryption keys out of hardware secure implementation.
[05:45.180 --> 05:53.000]  They are also used in web security to perform blind SQL injection, where you cannot see the return of a SQL statement on SQL injection.
[05:53.000 --> 05:58.920]  And then, they are also used to steal passwords and PINs from secure implementation.
[05:59.200 --> 06:07.080]  And they are also used, as I mentioned in the early example at the beginning, to recover crypto-wallet private keys.
[06:07.080 --> 06:17.920]  Those are only four examples of the many ways you can use side-chain attacks, which is basically when you need to verify something you cannot observe, a side-chain attack is usually the way to go.
[06:17.920 --> 06:20.020]  Let's make that a little bit more concrete for our use case.
[06:20.020 --> 06:31.580]  When you do a computation for crypto, what you do is you feed a plain text and you feed a secret key, and the algorithm is running on the CPU.
[06:31.780 --> 06:34.780]  And while it's running, we have some leakage.
[06:34.840 --> 06:37.420]  The first one is how long the computation takes.
[06:37.420 --> 06:45.900]  While it's not very relevant to AS when you have hardware acceleration, actually, depending on the computation, the time might change depending on the key.
[06:45.900 --> 06:54.320]  That was mostly used for RSA a long time ago because RSA uses exponentiation, and when it's not constant time, it can actually help to recover the key.
[06:54.320 --> 06:58.820]  That's one of the examples, but again, timing is how long the computation takes.
[06:58.860 --> 07:01.740]  Then we have the one we use in this talk, which is current.
[07:01.740 --> 07:13.440]  Of course, depending on the operation you do, depending on how many registers you load, how many registers you unload, the amount of power consumption will vary from clock to clock.
[07:13.440 --> 07:18.060]  And so that's what we can measure, and that's what we can use to infer what is happening.
[07:18.480 --> 07:22.800]  Then we have a third one, which is less used, but it's still possible.
[07:22.800 --> 07:23.840]  It is heat.
[07:23.840 --> 07:32.740]  Of course, depending on which part of the... which type of operation you do, different parts of the CPU will be used, and as a result, some parts will be hotter than others.
[07:33.260 --> 07:37.500]  A little bit arcane, not most rarely used, I think.
[07:37.500 --> 07:41.200]  I haven't seen very much concrete examples, but it does exist.
[07:41.200 --> 07:46.860]  Last but not least, we also have what we call electromagnetic emission, EM.
[07:46.860 --> 07:53.100]  And EM is also a very powerful channel, widely used to recover public keys.
[07:53.100 --> 08:01.680]  It is, with current, probably one of the two most used in side-channel attack, timing being also important, as I've mentioned.
[08:01.800 --> 08:06.280]  But really, current and EM is probably like the two leading technologies these days.
[08:06.920 --> 08:08.540]  What does it look like in practice?
[08:08.540 --> 08:12.720]  So here is a power trace of an AES.
[08:12.720 --> 08:24.160]  And if you look carefully in the middle of the slide, you'll see, well, about 10 times the same pattern, and they do correspond to the 10 rounds of AES.
[08:24.300 --> 08:33.420]  So you can visually see on a non-protected, or very likely protected AES, that we can observe the round by just looking at the power trace.
[08:33.420 --> 08:42.040]  If we can observe it as human, it means there are some statistical information there that can be exploited to understand what is happening.
[08:42.040 --> 08:49.600]  And that's, in a sense, the side channel, the side effect, that we can use to recover an AES key.
[08:49.960 --> 08:51.480]  In a nutshell, how you go about that?
[08:51.480 --> 08:56.700]  Well, you get a CPU, that's the target, to do the encryption.
[08:56.700 --> 09:03.440]  And what you do is, you do record while the encryption is performed, the power trace using an oscilloscope.
[09:03.660 --> 09:14.740]  And then, in the traditional side channel attack, the state-of-the-art is called a template attack, where you basically combine the traces you observe, and you make statistical estimates on what it could be.
[09:14.740 --> 09:18.300]  And the statistical estimate will help you to recover your AES key.
[09:18.300 --> 09:20.220]  This is how it works.
[09:20.680 --> 09:25.160]  If you wonder what type of hardware we use, and again, this is what we use.
[09:25.160 --> 09:27.660]  It doesn't mean that's the best, it's just what works for us.
[09:27.700 --> 09:30.500]  We use an UAE CheapWaveScope Pro.
[09:30.820 --> 09:40.380]  And for some of our work, however non-distop, we also use a PicoScope 6000, when we need faster sampling, when the target is very fast, and we need a lot of information.
[09:40.380 --> 09:52.420]  So, for example, last year, when we talked about SCAML, which is a way to do a side-channel attack using machine learning, which we're going to briefly recap in a second, then we do use a PicoScope.
[09:52.420 --> 09:55.900]  For this work, we just use plain old CheapWave scores.
[09:56.120 --> 10:04.460]  And again, this is not an app, this just happens to be what we use in practice as we want to make this talk as concrete as we can.
[10:04.880 --> 10:13.620]  Now that we have an idea of what a side-channel attack are, let's talk about how you would go about using AI to perform such an attack.
[10:13.660 --> 10:20.280]  Well, this is something we call a side-channel attack automated with machine learning, also known as CAML.
[10:20.280 --> 10:24.200]  This is what we did present in-depth last year.
[10:24.200 --> 10:36.240]  But let me briefly recap how they work in practice, because we're going to need the model that we create using a CAML attack to do the explainability and then find out what the leakage is.
[10:37.120 --> 10:43.200]  If you want to have more detail, by the way, about this type of attack, well, you can check out last year's talk.
[10:43.200 --> 10:49.320]  It's available on my website at eddy.net slash CAML.
[10:49.320 --> 10:53.260]  I also probably put a link on to Twitter if you want to follow along.
[10:53.260 --> 10:55.220]  And again, you'll have way more detail.
[10:55.400 --> 11:00.440]  I'm going to try to shrink down the explanation as much as I can so you can follow along.
[11:00.440 --> 11:03.660]  But if you want all the detail, they are in the previous talk.
[11:03.920 --> 11:11.400]  I also want to say this year is different from last year, so for people who follow both, in two senses.
[11:11.400 --> 11:25.760]  This time, last year when we talked about using machine learning to attack hardware encryption, we took the worst case, which is we are doing black box attack where the attacker doesn't have any knowledge about the target.
[11:25.940 --> 11:36.360]  And, for example, cannot have access to the clock because the clock is usually not accessible after when the key is in or the hardware implementation is in production mode.
[11:36.960 --> 11:48.480]  And so as a result, we were collecting trace in an asynchronous manner, which means that the clock of the target and the clock of the oscilloscope were different, which is why we had to use very high sampling rate.
[11:48.480 --> 11:57.000]  In this specific case, this year we are changing the model because this time this code is for people who are developing implementation.
[11:57.100 --> 12:03.160]  So we do assume you have the code, you have the hardware target, you can put it in debug mode.
[12:03.160 --> 12:07.260]  And so we don't need to create asynchronous traces.
[12:07.260 --> 12:14.200]  We also have a good idea, if you're a developer, at what time AES starts and what time it ends.
[12:14.200 --> 12:17.060]  So you can also create shorter traces.
[12:17.420 --> 12:30.460]  The reason why to do shorter captures is because machine learning has an easier time if you don't capture the whole thing, because that means it spends less time when it trains to eliminate part of the trace which is useless.
[12:30.460 --> 12:35.440]  So again, this is a white box attacker model. It makes more sense for that work.
[12:35.440 --> 12:45.240]  However, do not try to compare the model you use in this talk, which are easier and smaller, than the one in the previous talk where we had a way harder task for the machine learning.
[12:45.240 --> 12:52.600]  So when you're in a black box, the machine learning works harder, which means you have to train more and use more complicated and more deeper architectures.
[12:52.600 --> 13:01.520]  In the case of a white box, when you are really laser focusing on one part of the implementation, here will be the first round, we don't need that.
[13:01.860 --> 13:08.680]  The way sectional attack assisted by machine learning works is very similar to the traditional sectional attacks.
[13:08.680 --> 13:15.340]  As in the previous case, you have the encryption which is running, and then you capture the trace.
[13:15.340 --> 13:26.820]  Don't forget as you capture the trace and use two of them to normalize them between 1 and minus 1 because machine learning actually works in that range, which is not what your traditional oscilloscope would output.
[13:26.820 --> 13:39.360]  And then we feed those traces to a deep neural network and we train to make predictions on what he thinks are the values which can be used to recover the key.
[13:39.360 --> 13:46.540]  And then we combine those to do a statistical estimate and hopefully you get back your IS key.
[13:46.920 --> 13:53.800]  One of the advantages of that, as illustrated last year, is that you do not need to do any kind of pre-processing.
[13:53.800 --> 14:01.280]  You can just feed the trace directly and there is less expert knowledge on it, so it's open to place to do that almost automatically.
[14:01.280 --> 14:12.120]  And so that makes it a little bit easier and also it's more powerful in a sense than the traditional template attack because of the reason mentioned.
[14:12.120 --> 14:18.280]  When you do a sectional attack, you do not necessarily directly target to recover the key.
[14:18.280 --> 14:30.160]  You target what we call attack point. In the case of TinyAIS, there are two points which works really well, which is a sub-byte in, which is when you XOR the key with a plain text.
[14:30.160 --> 14:35.540]  And there is also the sub-byte out, which is when you look at the output of the else box.
[14:36.540 --> 14:44.640]  In this talk, we're going to focus on one of them, which is a sub-byte in, which we know works really well based on our experience and this is something which is a point.
[14:44.640 --> 14:50.560]  So basically, the machine learning will now predict the key. That really doesn't work when you try that.
[14:50.560 --> 15:03.640]  But instead, it's going to predict the sub-byte in and will basically, if you want to do the real attack, you take the sub-byte in and then you have to invert it using the plain text or do another XOR and then you get the key and then it can be a prediction.
[15:04.020 --> 15:06.660]  So the target point today is sub-byte in.
[15:07.500 --> 15:14.260]  When the machine learning will predict the sub-byte in, what it does is you get a trace and then it tells you, okay, here is a softmax.
[15:14.260 --> 15:24.660]  So a most probabilistic output of 256 values will tell you which is the most important value, but also what you think is the second-best value, third-best value, and so forth.
[15:24.740 --> 15:26.760]  And that's what the softmax does.
[15:26.840 --> 15:36.740]  So what you do on how to combine those things, that's why we call it a probabilistic attack, is you basically sum them up using log10 because of finding errors.
[15:36.740 --> 15:46.280]  And then you combine them and hopefully by combining them, you get the most likely value and machine learning is correct most of the time.
[15:46.280 --> 15:47.840]  So it will quickly converge.
[15:47.840 --> 15:53.340]  Last year, we showed that for 10 years on a full trace, we only need four traces.
[15:53.340 --> 16:00.800]  So you can see, you will see that in the specific settings today where it's even easier, well, you need as little as that too.
[16:00.800 --> 16:08.120]  So basically three, four traces for simple cases, you get the correct value and you recover one byte of the key.
[16:08.760 --> 16:12.720]  What's important to mention as well here is we have one model per byte.
[16:12.720 --> 16:15.580]  So we have 16 bytes, which is the normal input.
[16:15.580 --> 16:19.620]  We have in reality 16 models which are performing one byte at a time.
[16:19.620 --> 16:22.200]  It is easier for the machine learning to predict one byte at a time.
[16:22.200 --> 16:24.620]  So, well, you have to train 16 times.
[16:24.980 --> 16:28.660]  I'm not going to show that on the slide because it's not relevant.
[16:28.660 --> 16:30.900]  If you can wait for one, you can wait for 16.
[16:30.940 --> 16:36.280]  However, there are some difference of accuracy between the bytes, but that's not relevant for this talk.
[16:36.520 --> 16:37.200]  Okay.
[16:37.760 --> 16:48.060]  For those fractures, this year we use a hyper-tuned residual 1D convolutional neural network.
[16:48.140 --> 16:50.280]  It's different between this model and last year.
[16:50.280 --> 16:52.800]  This model is way more efficient.
[16:52.800 --> 16:58.620]  It's smaller and I think it's 300,000 points or something, 300,000 neurons per meter.
[16:59.340 --> 17:02.300]  It's way tuned and it works really well out of the box.
[17:02.300 --> 17:08.640]  This is kind of like our go-to model these days, which is based on our previous work and testing a ton and a ton of models.
[17:09.320 --> 17:17.140]  The paper I said last year, it's kind of funny, but the paper on all our tests about all the machine learning models will be out at some point.
[17:17.140 --> 17:19.400]  I said last year it will be soon.
[17:19.400 --> 17:24.680]  We have some technical difficulty to make everything reproducible and we have a lot of improvement.
[17:24.680 --> 17:38.060]  But hopefully, I'm really hopeful, knocking on wood, these models and the data that we use to gain our expertise into this will be out and you guys can test it out, hopefully in the near future.
[17:38.060 --> 17:47.840]  For Scaled, you need the 16 models because you need to know what is the commonality between all the bytes to know exactly what is the main source of leakage.
[17:47.840 --> 17:53.200]  So you train 16 models. As I mentioned, the accuracy varies.
[17:53.320 --> 17:56.860]  As you can see, you reach something that is a validation accuracy, to be clear.
[17:57.200 --> 18:09.800]  So on data, which has not been seen during the training, you reach something between 63 for the worst one, which I think is byte 4, up to 87% for the best one, which is byte 0.
[18:10.040 --> 18:12.080]  And again, they're all between that.
[18:12.780 --> 18:14.860]  For Scaled, it doesn't really matter.
[18:14.860 --> 18:23.040]  What you need is to be able to isolate enough examples that the machine learning is correct, because those are the ones we're going to use for explainability.
[18:23.220 --> 18:27.880]  We want to know what the model is using when it's correct.
[18:28.020 --> 18:34.600]  That's why we don't train more than that. Try to imagine each of them is about, I would say, 15 to 20 minutes.
[18:34.680 --> 18:38.080]  So you do 3 per hour and you have 16 to go.
[18:38.080 --> 18:42.260]  So, you know, that's already about 5 hours of training time.
[18:42.260 --> 18:46.280]  So we don't want to do 20 epochs, which would be completely overkill.
[18:46.600 --> 18:59.260]  And as I said, unsurprisingly, because 10 years is not protected against side-channel attacks and our model is really well-optimized, we have high accuracy in 5 epochs for all of them.
[18:59.580 --> 19:03.200]  Now, the model is good at extracting the key.
[19:03.200 --> 19:07.220]  So we are able to use it and we can consistently recover keys.
[19:07.220 --> 19:13.480]  Now the question is, OK, how does it help us to go back and, well, find where the leak is coming from?
[19:13.480 --> 19:18.380]  Well, this is where you need to add another piece, which is deep learning explainability.
[19:18.680 --> 19:20.680]  So what is deep learning explainability?
[19:20.920 --> 19:25.460]  Deep learning explainability was first developed, I believe, for vision.
[19:25.460 --> 19:28.440]  And the idea was, I have a deep learning network.
[19:28.480 --> 19:33.480]  And it says, in this picture, we have a boxer and we have a tiger cat.
[19:33.580 --> 19:36.320]  Now you can ask the question, OK, but why?
[19:36.320 --> 19:39.560]  Does it really look at the cat? Does it really look at the dog?
[19:39.560 --> 19:42.360]  Or does it look at statistics? I don't know.
[19:42.360 --> 19:51.060]  Maybe the stripe colors of the tail of the cat or maybe the dog has a leash or something like that.
[19:51.060 --> 19:55.640]  So what you want to do is you want to have a way to ask the neural network, what do you look at?
[19:55.640 --> 19:58.120]  And that's what explainability is about.
[19:58.120 --> 20:02.980]  It's being able to say, for a model, how does it come up with a given prediction?
[20:02.980 --> 20:06.740]  Why for this specific class, I'll put one output neuron.
[20:07.060 --> 20:10.360]  What did you use as input and what does the input matter to you?
[20:10.520 --> 20:14.260]  So it's basically almost inverting the machine learning model, if you will.
[20:15.040 --> 20:17.420]  And so we call that explainability.
[20:17.500 --> 20:20.480]  There are many techniques, but the idea is you...
[20:20.480 --> 20:25.380]  All the techniques have in common that you feed the machine, the model, to the explainer.
[20:25.680 --> 20:28.840]  Then you feed the input you would like information about.
[20:28.840 --> 20:33.420]  And then you have to tell him which class you want to have an explanation for.
[20:33.420 --> 20:39.880]  This is why, as I explained earlier, we need for Skoal to have models which work well.
[20:39.880 --> 20:46.100]  Not necessarily 99%, but at least very well because we need to have examples of predictions which are successful
[20:46.100 --> 20:52.060]  because we want to know for a given trace and a successful prediction what to think.
[20:52.060 --> 20:55.440]  So, as I mentioned, you give it to the explainer.
[20:55.440 --> 20:59.960]  On the other side, you give back the picture we had.
[20:59.960 --> 21:04.020]  And we say, OK, why did you believe it was a boxer?
[21:04.020 --> 21:11.440]  And hopefully, it will tell you, well, I look at the face of the puppy and it's a boxer.
[21:11.480 --> 21:13.240]  OK, that's reasonable.
[21:13.640 --> 21:16.820]  And you can ask also, OK, how about the cat?
[21:17.600 --> 21:22.840]  And hopefully, and again, this is a real example out of one of the explanation techniques,
[21:22.840 --> 21:25.020]  it will tell you, well, I look at the cat.
[21:25.020 --> 21:29.800]  And the reason why I think it's a cat is because, well, there is a face of a cat,
[21:29.800 --> 21:31.660]  but the most important thing is there are stripes.
[21:32.200 --> 21:35.040]  As you can see, it's the right part of the image.
[21:35.700 --> 21:37.360]  And you're like, OK, that makes sense.
[21:37.380 --> 21:41.240]  I guess a cat which has stripes is probably a tiger cat.
[21:41.240 --> 21:41.980]  Makes sense.
[21:41.980 --> 21:45.420]  The machine learning is actually looking at what it should.
[21:45.680 --> 21:47.980]  Was this technique ever useful?
[21:47.980 --> 21:49.920]  And the answer is yes, absolutely.
[21:49.920 --> 21:54.820]  There is this very, very famous data set in machine learning for vision.
[21:54.820 --> 21:59.180]  And it's called the Pascal Vocabulary, Visual Vocabulary.
[21:59.180 --> 22:04.140]  And what happened was, in the early version of this data set,
[22:04.140 --> 22:13.400]  I think one picture in five horses had a bottom left name of who took it.
[22:13.420 --> 22:16.940]  And what the machine learning learned was not to recognize the horse,
[22:16.940 --> 22:21.720]  it was to recognize, well, that there was something on the bottom left.
[22:21.760 --> 22:24.880]  All right, so that's what explainability is for, so that's what we would like,
[22:24.880 --> 22:27.960]  because essentially what we want to ask the machine learning is,
[22:27.960 --> 22:32.140]  given an input and a prediction,
[22:32.140 --> 22:35.260]  can you tell me how you figure out what is leaking, right?
[22:35.360 --> 22:39.940]  Where do you get information to get to the conclusion that it's the correct key,
[22:39.940 --> 22:41.520]  or the correct attack point?
[22:42.560 --> 22:45.160]  Well, that seems great in practice.
[22:45.160 --> 22:47.740]  It seems great in theory, and we can see where it's going.
[22:47.740 --> 22:52.020]  The thing is, that doesn't tell us how we're going to combine this explainability techniques
[22:52.020 --> 22:54.460]  and dynamic analysis to debug leakage, right?
[22:54.460 --> 23:01.380]  Because so far, all I explained to you is maybe how we can get some part of the output highlighted,
[23:01.380 --> 23:04.680]  but doesn't tell you how you go back to know where the leakage comes from.
[23:04.960 --> 23:07.680]  And so that's all the difficulty of code,
[23:07.680 --> 23:11.800]  and that's why it took us about one year, over a year,
[23:11.800 --> 23:14.580]  to actually really know how to get that done.
[23:14.580 --> 23:19.220]  It's because you need to be very creative around how to combine those things,
[23:19.220 --> 23:22.760]  which in theory should make sense, to actually get the result you want.
[23:22.760 --> 23:25.400]  So let's deep dive into how you get there.
[23:25.600 --> 23:27.680]  Our game plan was fairly straightforward.
[23:27.680 --> 23:31.700]  Again, we start with an explainer, we're going to give it our train models,
[23:32.240 --> 23:35.160]  and then we're going to give it the trace and the prediction,
[23:35.160 --> 23:39.500]  and say, okay, please tell me of the trace,
[23:39.500 --> 23:42.220]  what are the important points for you to make your prediction?
[23:42.220 --> 23:43.860]  We're going to call that the leakage map.
[23:43.860 --> 23:51.860]  Then we're going to also run, not necessarily after, but in parallel,
[23:52.840 --> 23:58.000]  a target emulator, which is basically we're going to run our target,
[23:58.000 --> 24:05.300]  which is the given CPU, the specific type, the SMT, F4, 32F4, and the firmware.
[24:05.300 --> 24:08.060]  So in our case, the firmware we choose is NES,
[24:08.060 --> 24:12.420]  and we're going to emulate it to be able to know
[24:12.420 --> 24:20.240]  at which every instruction cycle in the ARM CPU corresponds to which opcode.
[24:20.240 --> 24:24.560]  So basically we need to know at what time, precise point in time,
[24:24.700 --> 24:30.360]  a given code instruction was run, and then which code instruction was run.
[24:30.360 --> 24:34.160]  And so with that, we can combine both of them.
[24:34.240 --> 24:42.400]  We can combine the leakage map, which tells us,
[24:42.400 --> 24:46.640]  at what time each instruction was run, to be able to annotate the code,
[24:46.640 --> 24:49.000]  which were the leakages, that's the idea.
[24:50.120 --> 24:53.560]  In practice, this is where it becomes complicated.
[24:53.740 --> 24:55.880]  There is, of course, a lot of techniques.
[24:56.180 --> 25:00.900]  Here is a screenshot, here is a figure from one of the recent paper,
[25:00.900 --> 25:04.020]  which is called Static Check for Static Map,
[25:04.020 --> 25:08.820]  which basically we're looking at how efficient is different type of explainability technique.
[25:08.820 --> 25:13.440]  You can see some of them have more defined opcodes than others.
[25:13.800 --> 25:17.220]  In this specific, in our research, we tried a bunch of them,
[25:17.220 --> 25:21.980]  including a guided graph cam, which seemed, when we started,
[25:21.980 --> 25:25.040]  to be giving the most defined and precise leakage.
[25:25.720 --> 25:28.100]  And then we tested a bunch of them.
[25:28.200 --> 25:31.420]  The idea was, which explainability technique would work best for us,
[25:31.420 --> 25:37.980]  because we wanted to highlight very precisely which part of the trace was the most important one.
[25:37.980 --> 25:40.340]  How you do that? Well, you get the explanation,
[25:40.340 --> 25:45.040]  so you run a lot of traces which were successful into the explanation.
[25:45.040 --> 25:47.120]  Here is the activation map technique.
[25:47.120 --> 25:54.020]  Then you combine them, and you normalize between 0 and 1 to create some sort of a mask.
[25:54.020 --> 25:57.760]  And then you eliminate all the noise,
[25:57.760 --> 26:04.240]  and hopefully you get the bottom image, which is this leakage map.
[26:04.240 --> 26:08.500]  You can see it's strided here, and you can see that some places which are lighter
[26:08.500 --> 26:12.740]  are supposed to be the place where the model is leaking the most.
[26:12.740 --> 26:17.660]  So according to activation maps, there is, I think it was for byte 0,
[26:17.660 --> 26:23.980]  very, very early there was a leakage, and very, very late, around the 4000 points,
[26:23.980 --> 26:27.920]  so at the end of our traces, some leakage.
[26:28.380 --> 26:31.960]  Depending on the technique you use, you're going to get different results.
[26:31.960 --> 26:35.700]  The first one we tested is SNR, which is not a deep learning technique,
[26:35.700 --> 26:41.580]  but is the standard technique used in tight channel attacks to detect whether or not there is a leak.
[26:41.660 --> 26:47.240]  So it's a very robust way, statistical, well-proved way, kind of like our baseline.
[26:47.240 --> 26:54.640]  So the signal-to-noise ratio tells you that, as you can see, there is a main leak,
[26:54.640 --> 26:57.240]  and there is a secondary leak somewhere in the middle of the trace.
[26:57.460 --> 27:01.900]  And then if we look at GRADCAM++, which is one of the latest techniques,
[27:02.760 --> 27:05.140]  the results are not so clear.
[27:05.440 --> 27:12.460]  We have a bunch of different points, which doesn't seem to align very well with the SNR,
[27:12.460 --> 27:13.580]  or at least they are less defined.
[27:13.580 --> 27:16.160]  There are zigzags in place, but not exactly the same.
[27:16.520 --> 27:22.620]  And then the activation map looks almost equivalent to the GRADCAM.
[27:24.220 --> 27:29.740]  Even the activation map looks at the output of the layer, the lowest layer.
[27:30.100 --> 27:31.940]  This seems very, very much the same.
[27:31.960 --> 27:34.320]  So how do you benchmark how good these experiments are?
[27:34.320 --> 27:37.200]  So the idea is, well, we have a leak map, right?
[27:37.200 --> 27:42.620]  So what we can do is we can take our test traces that we know the machine learning is successful at predicting,
[27:42.620 --> 27:50.320]  and then we can decide to use the leak map and, let's say, remove the four points, or the eight points,
[27:50.320 --> 27:53.560]  which are supposed to be the most important, according to the leakage map, out of the trace.
[27:53.560 --> 27:58.300]  We can just literally blank them. Blanking them means put them to zero, or to minus one.
[27:58.300 --> 28:00.600]  Put them to zero, but some idea.
[28:01.400 --> 28:07.120]  You basically remove the information there, and hopefully, if they are the most important part of the prediction,
[28:07.120 --> 28:14.400]  then the accuracy of the model, if you feed it again, should result in aggregate to decrease accuracy, right?
[28:14.400 --> 28:19.700]  The idea is that if you blank out the points, which are used by the machine learning to make the prediction,
[28:19.700 --> 28:25.500]  the accuracy should decrease. So mechanically, the best technique should yield the best decrease.
[28:25.780 --> 28:31.240]  Baseline, as I said, 100%, because we only used traces, you remove four points.
[28:31.240 --> 28:39.080]  Why four points? It's because an instruction text, when we capture them with the oscilloscope, is four points, right?
[28:39.080 --> 28:42.700]  So each cycle of the CPU is supposed to be four points.
[28:42.700 --> 28:47.800]  So if we know that, we say, okay, let's try to remove the most important cycle.
[28:48.860 --> 28:54.800]  If we do it with SNR, the technique seems to work, right? It seems we reduced by 57% and 44%.
[28:55.560 --> 29:01.920]  If we do the activation map, and it was our first very big disappointment sometime in the project,
[29:01.920 --> 29:07.260]  last year, not last year, but like a few months back, was like, oh, well, it doesn't work that well.
[29:07.260 --> 29:11.040]  SNR is better. So activation map doesn't work really well.
[29:11.040 --> 29:17.740]  And then, if you remember, I showed you that the leakage map are almost the same, and the result are the same for GraphQAM+.
[29:19.040 --> 29:23.820]  So the idea seems to work in the sense that SNR works.
[29:24.500 --> 29:29.000]  Our deep learning explainability, which is way more complicated and should be way better, doesn't.
[29:29.060 --> 29:35.240]  So that was a little bit disappointing. And so it was like, back to the drawing board.
[29:35.240 --> 29:37.860]  What can we do? Let's go back to that.
[29:37.860 --> 29:43.140]  And so the way we went about that is like, OK, let's write our own technique,
[29:43.140 --> 29:50.020]  because what we want is to only find the top point so we can probably do something with a very old idea,
[29:50.020 --> 29:52.860]  which is occlusion, and try to make it a little bit better.
[29:53.120 --> 29:58.900]  So I know it seems weird to say, OK, let's invent a new technique for a specific task.
[29:58.900 --> 30:03.140]  But the thing is, we are trying to do something very, very unique, very, very precise.
[30:03.140 --> 30:06.960]  We don't want to have like the exact region and have more like a holistic understanding.
[30:06.960 --> 30:10.140]  We want to have like top 5 or top 20 points.
[30:10.140 --> 30:13.800]  So because our optimization for our technique is different,
[30:13.800 --> 30:18.160]  and the type of underlying algorithm you want to use is a little bit different,
[30:18.160 --> 30:22.620]  we can use occlusion. We should literally try to use a window to do that.
[30:22.620 --> 30:28.360]  And so Scaled Explanation Technique is actually exactly that.
[30:28.360 --> 30:35.100]  It's a hybrid version of occlusion where we start to eliminate large regions
[30:35.100 --> 30:39.380]  so we don't choose more regions, and then with convolutions,
[30:39.380 --> 30:42.140]  convolutive occlusion, which is something we developed for this,
[30:42.140 --> 30:48.920]  to actually really pinpoint which part of the trace for the region that we think are predictive,
[30:48.920 --> 30:51.760]  which of the exact points are the most important.
[30:51.980 --> 30:58.080]  And as you can see, the traces are way cleaner than the one we had before.
[30:58.680 --> 31:02.420]  And they clearly outline leakage.
[31:02.420 --> 31:05.620]  Also, as you can see, if we took byte 0 and byte 7,
[31:05.620 --> 31:11.220]  it is clear that byte 0 is before byte 7 for in-terms of leakage,
[31:11.220 --> 31:16.200]  which makes a ton of sense because obviously TinyES processes one after the other.
[31:16.200 --> 31:20.420]  It's not vectorized code. So that will be exactly what we expect.
[31:21.420 --> 31:26.820]  Okay, so we're like, okay, trace-study code should work, right?
[31:26.820 --> 31:30.620]  So go back to the benchmark, run the thing, and voila!
[31:30.620 --> 31:35.840]  The numbers are way better, or at least better, for byte 7.
[31:35.940 --> 31:40.480]  And so, again, so now we have a technique which works better.
[31:40.540 --> 31:44.280]  Our map is more precise, so we hope that the leakage pinpoint will be more precise.
[31:45.220 --> 31:49.160]  And just to give you a visual comparison, to finish on that side,
[31:49.160 --> 31:54.320]  it's very clear that, A, the SNR and SCALD basically find roughly the same region,
[31:54.320 --> 31:58.320]  except the region is better defined with SCALD, which is good,
[31:58.320 --> 32:03.160]  because the SNR is not wrong in general, and B, we have something which is cleaner,
[32:03.160 --> 32:07.880]  which means that we have improved over existing state-of-the-art
[32:07.880 --> 32:11.360]  and also finding something completely out of the Teracle model.
[32:11.360 --> 32:15.580]  So if the Teracle model is more precise, it's good.
[32:16.220 --> 32:21.500]  And so, as I said, benchmark shows that in every case,
[32:22.060 --> 32:24.600]  SCALD actually outperformed everything we tested.
[32:24.680 --> 32:28.300]  It's not to say there is not a better way, but this is working for us.
[32:28.300 --> 32:30.820]  And this is good enough, as we will see.
[32:30.920 --> 32:34.540]  So we came to favor this thing because it's also quite fast.
[32:34.740 --> 32:39.080]  It takes, I don't know, about 10 minutes to explain a network.
[32:39.160 --> 32:45.840]  Okay, now let's explain in greater detail how you go from model to leakage map.
[32:45.840 --> 32:47.900]  The question is how you go from leakage map to code.
[32:48.160 --> 32:54.820]  Well, for that, as I mentioned, we have an emulator, which is based on Unicorn and Rainbow.
[32:54.820 --> 32:58.980]  And basically, we run it with a firmware on the CPU.
[32:59.280 --> 33:06.580]  And we basically have a state automaton to try to emulate what's happening during the leakage map.
[33:06.580 --> 33:10.660]  But the idea is you want to do AS start, run the AS, AS stop,
[33:10.660 --> 33:15.120]  and basically pick up what is in sync with the leakage map.
[33:15.320 --> 33:19.600]  And then what happens then is you get mapped ASM.
[33:19.600 --> 33:24.380]  So what you get is you get, okay, this cycle maps to this time,
[33:24.380 --> 33:31.640]  this point in the track map to that cycle, that cycle map to this specific CPU instruction.
[33:32.020 --> 33:37.020]  And then what we do is when we have a CPU instruction, we build a tree.
[33:37.020 --> 33:45.820]  And the tree, we bubble up the instruction to a given code line using the debug symbol of our firmware.
[33:45.820 --> 33:51.100]  The firmware we run, the firmware we have is in debug mode.
[33:51.100 --> 33:55.160]  Again, we're back to this idea that this is a tool for developers.
[33:55.160 --> 34:00.540]  So when you test for leakage, you can compile with the debug symbol because you don't try to harden it.
[34:00.540 --> 34:01.840]  We just want to debug it.
[34:01.840 --> 34:07.160]  So we have the debug symbol which helps us to go back from the instruction back to the line of code.
[34:07.160 --> 34:11.100]  And hopefully, that gives us an idea of where to map the code.
[34:12.440 --> 34:18.200]  So there's a theory, but we haven't told you or haven't shown you if it really works in practice.
[34:18.200 --> 34:24.480]  And the thing is, before I do that, I want to be on phase three of this project.
[34:24.480 --> 34:33.520]  The first one is, we spend a lot of time talking about explainability because if we don't have extremely precise explainability technique,
[34:33.520 --> 34:43.980]  which exactly pinpoint what point in the trace is responsible for the leakage, we're going to fail.
[34:43.980 --> 34:53.080]  We're going to fail because most instructions on ARM are two, maybe three, maybe four seconds long at most.
[34:53.080 --> 34:57.480]  Not at most, but one to two instructions for an additional standard.
[34:57.700 --> 35:00.420]  So it's literally four to eight points.
[35:00.660 --> 35:04.660]  Like our margin of error is maybe one point, but that's about all we get.
[35:04.660 --> 35:10.520]  So if we don't have precise mapping, then if we map, let's say, 10 points ahead or if we have a Windows of 10 points,
[35:10.520 --> 35:13.520]  it's meaningless for us because then it might be three instructions.
[35:13.520 --> 35:16.340]  And those instructions might belong to a different line of code.
[35:16.340 --> 35:19.560]  So basically, your analysis is completely botched.
[35:19.880 --> 35:26.020]  At the same time, we need to have an emulator, which is also single cycle precision,
[35:26.020 --> 35:31.460]  because what happens is, if you do not take into account, let's say, pipeline flush,
[35:31.460 --> 35:38.000]  if you don't take into account the CPU pipeline size and things like that, you get it wrong.
[35:38.040 --> 35:40.960]  You get it wrong because you're going to shift everything by, let's say,
[35:40.960 --> 35:46.580]  by one cycle for each addition, then your whole traces is completely shifted.
[35:46.580 --> 35:51.300]  And then even if your leakage mapping is good, you get somewhere in the code which is not relevant.
[35:51.300 --> 35:57.380]  So you need cycle precision emulator, cycle precision, single point precision explanation.
[35:57.520 --> 36:02.980]  So you need something extremely precise. This is very, very much a precise work.
[36:02.980 --> 36:06.140]  And then on top of that, you need a bit of computation.
[36:06.140 --> 36:11.000]  Again, as I said, you need to generate about a... we use, I think, a one million data point,
[36:11.000 --> 36:16.500]  something like that. And then even for the explanation time, we use 60,000 traces.
[36:17.080 --> 36:20.520]  You need 16 models. As I mentioned, you can shorten the time, of course,
[36:20.520 --> 36:23.200]  by using good models who convert very quickly.
[36:23.680 --> 36:27.500]  You need to generate 16 explanations for all those models in your trace,
[36:27.500 --> 36:29.040]  and then you need to map everything.
[36:29.460 --> 36:35.100]  So with our optimization, that will take you a day or so of work.
[36:35.100 --> 36:39.860]  And, of course, as I said, most of it is parallelizable because the 16 models
[36:39.860 --> 36:43.420]  and 16 explanations can be run on distinct CPU.
[36:43.420 --> 36:47.260]  So you can make it 16 times faster.
[36:47.340 --> 36:52.660]  So that's really good because we want to use that as a fast iterative tool for developers.
[36:53.520 --> 36:58.600]  All right. So at this point, you're like, OK, he gave me a long speech of why it's really hard.
[36:58.600 --> 37:03.080]  So probably they failed. And it's really, really hard because, yeah, emulating a CPU is hard.
[37:03.080 --> 37:07.180]  And I'm telling you it's true. I don't think we have perfect emulation of CPU.
[37:07.180 --> 37:15.200]  I think we have very precise emulation of the instruction we need for AES.
[37:15.200 --> 37:23.080]  In particular, one thing we don't have, and disclaimer, is we don't have implemented mapping for divide.
[37:23.260 --> 37:32.320]  Why? It's because division actually takes a lot of cycle and a very different branch of cycle on ARM CPU.
[37:32.320 --> 37:39.240]  So we don't really know what to do there. So that's only working for now for AES and for transparency.
[37:39.440 --> 37:44.660]  That being said, if we try to apply what I said to our model,
[37:44.660 --> 37:51.140]  what we expect to see in the theoretical sense is we're supposed to... if the model is targeting sub-byte in,
[37:51.140 --> 37:57.380]  then it must exploit something in the add-on key because add-on key is where you do...
[37:57.380 --> 38:02.420]  the key exploits the plaintext. So it must have most of the... mainly cache should be there.
[38:02.420 --> 38:09.260]  So it's a theoretical thing. And so what we do to verify that what we said works is to actually try to verify that.
[38:09.520 --> 38:13.420]  And so that's what the schedule output looks like. It's a terminal thing.
[38:13.420 --> 38:19.960]  You basically run it and at the end spit out this, as I said, tree, which maps cycle instruction,
[38:19.960 --> 38:25.400]  which is not displayed for visibility, to code line, which is basically numbers.
[38:25.400 --> 38:32.380]  And then I filter everything which is not leaking. And then what this mapping tells you is, yes,
[38:32.380 --> 38:37.860]  the main leakage is on line 2.13 of add-on key. So that's promising.
[38:37.860 --> 38:44.500]  It also has, interestingly enough, a second leakage, which is later on into the cipher functions.
[38:46.660 --> 38:53.700]  So what is the line 2.13, right? Well, the good news is it's exactly what we predicted.
[38:53.700 --> 38:59.920]  This is exactly the line in the whole code that you can find on GitHub,
[38:59.920 --> 39:05.040]  which is exactly where the key is XORed with the plaintext.
[39:05.240 --> 39:12.200]  So the model, and Scaled in general, really clearly is using what a theorem would predict,
[39:12.200 --> 39:16.760]  which is there is a leak in that specific line because they are doing...
[39:17.340 --> 39:22.340]  they are showing the value of the register with some assignment.
[39:22.540 --> 39:25.700]  And so this is what it is. So that's a success.
[39:25.700 --> 39:32.480]  That's what gives us confidence that what we do works in practice. It's really giving us interesting results.
[39:32.740 --> 39:38.360]  I will not claim it works in every case. I'm sure that with more complex implementation,
[39:38.360 --> 39:43.600]  like masked AES or more complicated security, the result might be drastically different.
[39:43.600 --> 39:49.560]  As I said, our emulator did not fully emulate all the operations it can do on ARM.
[39:49.640 --> 39:54.180]  There is still a lot of uncertainty, but however, it works.
[39:54.380 --> 39:58.460]  I think it's a very promising step towards the right direction of having tools
[39:58.460 --> 40:02.860]  who go back from leakage to exact line of code.
[40:02.860 --> 40:06.020]  As far as we can tell, it's the first time it has ever been done.
[40:06.500 --> 40:10.120]  Back to the secondary leakage, I don't have a good explanation for it.
[40:10.120 --> 40:17.120]  The best thing I can come up with today is that it probably went when some of the registers are unloaded.
[40:17.120 --> 40:22.760]  And so this is where it started with a mixed column, if you look at line 371.
[40:23.200 --> 40:28.480]  So maybe that's what happened, it's unloading some of the registers and then going back.
[40:28.480 --> 40:33.780]  We need a little bit more analysis for that, but basically it will be interesting to know why there is another leakage.
[40:34.040 --> 40:39.560]  So hopefully today I showed you how we use Scalar to automatically isolate the number of codes
[40:39.560 --> 40:45.480]  and show you how it works in practice and show you concretely what is the benefit of the tool.
[40:45.780 --> 40:50.500]  And really our hope is this type of tool, we keep building it and we get feedback from the community
[40:50.500 --> 40:57.760]  to build something which will empower people who develop secure hardware to quickly figure out and patch
[40:58.660 --> 41:04.240]  where the leakage are coming from so we can develop stronger crypto in an easier way and a faster iteration
[41:04.240 --> 41:08.300]  so we all benefit from stronger, more secure devices.
[41:09.480 --> 41:17.120]  And to take over the talk, as I said, as we discussed last year, machine learning is a way to automate
[41:17.720 --> 41:22.300]  such an attack and you reach state-of-the-art with it so it's really one of the most...
[41:22.880 --> 41:29.880]  it's at the forefront of such an attack and this year we flipped the use case on its head
[41:29.880 --> 41:33.620]  which is really what was the intent of the project from the onset two years ago
[41:33.620 --> 41:39.880]  which is to try to use all those knowledge to actually help developers building better tooling
[41:39.880 --> 41:44.780]  for them to reduce the cost of developing secure implementation and make them better.
[41:45.180 --> 41:50.160]  So that being said, this is a very, very new field, a very exciting field with a lot of ideas,
[41:50.240 --> 41:54.700]  a lot of energy around it and it can really use more people interested in it
[41:54.700 --> 41:59.380]  so if you have some interest into crypto, into machine learning, it's a great time to get in
[41:59.380 --> 42:03.440]  and work with the community on this type of ideas.
[42:04.420 --> 42:08.080]  Thank you so much for attending this virtual talk. I wish it would be in person.
[42:08.080 --> 42:12.260]  I'm going to miss DEF CON as you probably do. I hope you're all well
[42:12.260 --> 42:18.300]  and if you would like to follow up and keep up with what we're doing, we try to publish as fast as we can
[42:18.740 --> 42:26.480]  modulo some delay to provide you information about what we do on such an attack on the website.
[42:26.480 --> 42:30.220]  Hopefully we'll have an official project website in the future as well.
[42:30.220 --> 42:36.580]  Thank you so much for listening to this talk and then do not hesitate to reach out on Twitter
[42:36.580 --> 42:42.460]  or by email or any other means. Happy to answer questions. Thank you so much. Bye.
